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Abstract 

We show that parametric models trained by a stochastic gradient method (SGM) with few 
iterations have vanishing generalization error. We prove our results by arguing that SGM 
is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs 
elementary tools from convex and continuous optimization. We derive stability bounds for both 
convex and non-convex optimization under standard Lipschitz and smoothness assumptions. 

Applying our results to the convex case, we provide new insights for why multiple epochs of 
stochastic gradient methods generalize well in practice. In the non-convex case, we give a new 
interpretation of common practices in neural networks, and formally show that popular tech¬ 
niques for training large deep models are indeed stability-promoting. Our findings conceptually 
underscore the importance of reducing training time beyond its obvious benefit. 


1 Introduction 

The most widely used optimization method in machine learning practice is stochastic gradient 
method (SGM). Stochastic gradient methods aim to minimize the empirical risk of a model by 
repeatedly computing the gradient of a loss function on a single training example, or a batch of few 
examples, and updating the model parameters accordingly. SGM is scalable, robust, and performs 
well across many different domains ranging from smooth and strongly convex problems to complex 
non-convex objectives. 

In a nutshell, our results establish that: 

Any model trained with stochastic gradient method in a reasonable 
amount of time attains small generalization error. 

As training time is inevitably limited in practice, our results help to explain the strong gen¬ 
eralization performance of stochastic gradient methods observed in practice. More concretely, we 
bound the generalization error of a model in terms of the number of iterations that stochastic gra¬ 
dient method took in order to train the model. Our main analysis tool is to employ the notion of 
algorithmic stability due to Bousquet and Elisseeff . We demonstrate that the stochastic gradient 
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method is stable provided that the objective is relatively smooth and the number of steps taken is 
sufficiently small. 

It is common in practice to perform a linear number of steps in the size of the sample and to 
access each data point multiple times. Our results show in a broad range of settings that, provided 
the number of iterations is linear in the number of data points, the generalization error is bounded 
by a vanishing function of the sample size. The results hold true even for complex models with large 
number of parameters and no explicit regularization term in the objective. Namely, fast training 
time by itself is sufficient to prevent overhtting. 

Our bounds are algorithm specific: Since the number of iterations we allow can be larger than 
the sample size, an arbitrary algorithm could easily achieve small training error by memorizing 
all training data with no generalization ability whatsoever. In contrast, if the stochastic gradient 
method manages to fit the training data in a reasonable number of iterations, it is guaranteed to 
generalize. 

Conceptually, we show that minimizing training time is not only beneficial for obvious com¬ 
putational advantages, but also has the important byproduct of decreasing generalization error. 
Consequently, it may make sense for practitioners to focus on minimizing training time, for in¬ 
stance, by designing model architectures for which stochastic gradient method converges fastest to 
a desired error level. 

1.1 Our contributions 

Our focus is on generating generalization bounds for models learned with stochastic gradient descent. 
Recall that the generalization bound is the expected difference between the error a model incurs on 
a training set versus the error incurred on a new data point, sampled from the same distribution 
that generated the training data. Throughout, we assume we are training models using n sampled 
data points. 

Our results build on a fundamental connection between the generalization error of an algorithm 
and its stability properties. Roughly speaking, an algorithm is stable if the training error it achieves 
varies only slightly if we change any single training data point. The precise notion of stability we 
use is known as uniform stability due to [^. It states that a randomized algorithm A is uniformly 
stable if for all data sets differing in only one element, the learned models produce nearly the same 
predictions. We review this method in Section and provide a new adaptation of this theory to 
iterative algorithms. 

In Section we show that stochastic gradient is uniformly stable, and our techniques mimic 
its convergence proofs. For convex loss functions, we prove that the stability measure decreases as 
a function of the sum of the step sizes. For strongly convex loss functions, we show that stochastic 
gradient is stable, even if we train for an arbitrarily long time. We can combine our bounds on 
the generalization error of stochastic gradient method with optimization bounds quantifying the 
convergence of the empirical loss achieved by SGM. In Section we show that models trained for 
multiple epochs match classic bounds for stochastic gradient [2^[2^ . 

More surprisingly, our results carry over to the case where the loss-function is non-convex. In 
this case we show that the method generalizes provided the steps are sufficiently small and the 
number of iterations is not too large. More specifically, we show the number of steps of stochastic 
gradient can grow as for a small c > 1. This provides some explanation as to why neural networks 
can be trained for multiple epochs of stochastic gradient and still exhibit excellent generalization. 
In Section we furthermore show that various heuristics used in practice, especially in the deep 
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learning community, help to increase the stability of stochastic gradient method. For example, the 
popular dropout scheme [^ 40 improves all of our bounds. Similarly, £ 2 -regularization improves 
the exponent of n in our non-convex result. In fact, we can drive the exponent arbitrarily close to 
1/2 while preserving the non-convexity of the problem. 


1.2 Related work 

There is a venerable line of work on stability and generalization dating back more than thirty 
years [^[8,18,26 39 . The landmark work by Bousquet and Elisseeff introduced the notion of 
uniform stability that we rely on. They showed that several important classification techniques 
are uniformly stable. In particular, under certain regularity assumptions, it was shown that the 
optimizer of a regularized empirical loss minimization problem is uniformly stable. Previous work 
generally applies only to the exact minimizer of specific optimization problems. It is not imme¬ 
diately evident on how to compute a generalization bound for an approximate minimizer such as 
one found by using stochastic gradient. Subsequent work studied stability bounds for randomized 
algorithms but focused on random perturbations of the cost function, such as those induced by 
bootstrapping or bagging |^. This manuscript differs from this foundational work in that it de¬ 
rives stability bounds about the learning procedure, analyzing algorithmic properties that induce 
stability. 

Stochastic gradient descent, of course, is closely related to our inquiry. Classic results by 
Nemirovski and Yudin show that the stochastic gradient method produces is nearly optimal for 
empirical risk minimization of convex loss functions HIM . These results have been extended by 
many machine learning researchers, yielding tighter bounds and probabilistic guarantees 
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However, there is an important limitation of all of this prior art. The derived generalization bounds 
only hold for single passes over the data. That is, in order for the bounds to be valid, each training 
example must be used no more than once in a stochastic gradient update. In practice, of course, 
one tends to run multiple epochs of the stochastic gradient method. Our results resolve this issue by 
combining stability with optimization error. We use the foundational results to estimate the error 
on the empirical risk and then use stability to derive a deviation from the true risk. This enables us 
to study the risk incurred by multiple epochs and provide simple analyses of regularization methods 
for convex stochastic gradient. We compare our results to this related work in SectionWe note 
that Rosasco and Villa obtain risk bounds for least squares minimization with an incremental 
gradient method in terms of the number of epochs [37] . These bounds are akin to our study in 
Section]^ although our results are incomparable due to various different assumptions. 

Finally, we note that in the non-convex case, the stochastic gradient method is remarkably 
successful for training large neural networks [^[^. However, our theoretical understanding of 
this method is limited. Several authors have shown that the stochastic gradient method finds a 
stationary point of nonconvex cost functions 12,21 . Beyond asymptotic convergence to station¬ 


ary points, little is known about finding models with low training or generalization error in the 
nonconvex case. There have recently been several important studies investigating optimal training 
of neural nets. For example Livni et al. show that networks with polynomial activations can be 
learned in a greedy fashion 24 . Janzamin et al. show that two layer neural networks can be 


learned using tensor methods. Arora et al. show that two-layer sparse coding dictionaries can be 
learned via stochastic gradient. Our work complements these developments: rather than providing 
new insights into mechanisms that yield low training error, we provide insights into mechanisms 
that yield low generalization error. If one can achieve low training error quickly on a nonconvex 
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problem with stochastic gradient, our results guarantee that the resulting model generalizes well. 


2 Stability of randomized iterative algorithms 


Consider the following general setting of supervised learning. There is an unknown distribution T) 
over examples from some space Z. We receive a sample S = {zi,...,Zn) of n examples drawn 
i.i.d. from T>. Our goal is to find a model w with small population risk, defined as: 

Here, where / is a loss function and f{w; z) designates the loss of the model described by w 
encountered on example z. 

Since we cannot measure the objective R[w] directly, we instead use a sample-averaged proxy, 
the empirical risk, dehned as 

1 " 


The generalization error of a model w is the difference 


/Zgju)] — R[w]. 


( 2 . 1 ) 


When w = A{S) is chosen as a function of the data by a potentially randomized algorithm A it 
makes sense to consider the expected generalization error 

egen = Es,A[i?s[^(5)] - R[A{S)]] , (2.2) 


where the expectation is over the randomness of A and the sample S. 

In order to bound the generalization error of an algorithm, we employ the following notion of 
uniform stability in which we allow randomized algorithms as well. 

Definition 2.1. A randomized algorithm A is e-uniformly stable if for all data sets S,S' G 
such that S and S' differ in at most one example, we have 

supE^ [/(^(S"); z) - f{A{S'); z)] < e . (2.3) 


Here, the expectation is taken only over the internal randomness of A. We will denote by estab(^)ra) 
the infimum over all e for which (2.3) holds. We will omit the tuple {A,n) when it is clear from 
the context. 


We recall the important theorem that uniform stability implies generalization in expectation. 
Since our notion of stability differs slightly from existing ones with respect to the randomness of 
the algorithm, we include a proof for the sake of completeness. The proof is based on an argument 
in Lemma 7 of and very similar to Lemma 11 in (^ . 

Theorem 2.2. [Generalization in expectation] Let A be e-uniformly stable. Then, 

|E5,a [i?s[7l(5)]-i?[H(5)]]| <e. 
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Proof. Denote by S = (zi,..., Zn) and S" = ..., two independent random samples and let 

= (zi,... , Zi-i, z[, Zi+i,..., Zn) be the sample that is identical to S except in the i’th example 
where we replace zt with z-. With this notation, we get that 


EsE^ [i2s[A(5)]] 


Es Ea 


1 " 

-J2f{A{sy,zi) 


Es Es/ Ea 


n 

-Y.fWS'-<yy,z\) 


Es E 5 / E^ 


1 

-Y.fwsy,z’,) 


lEsE^[fl[^(S)]] + S, 


+ <5 


where we can express 5 as 


5 


Es Es' Ea 


1 "■ 

-Y^f{A{S''^)-z’) 


n 

-Y.nMsy,z[) 


Furthermore, taking the supremum over any two data sets S, S' differing in only one sample, we 
can bound the difference as 


\6\ < sup Ea [f{A{S); z) - f{A{S'); z)] < e, 

S,S',z 

by our assumption on the uniform stability of A. The claim follows. 


□ 


Theorem 2.2 proves that if an algorithm is uniformly stable, then its generalization error is 
small. We now turn to some properties of iterative algorithms that control their uniform stability. 


2.1 Properties of update rules 

We consider general update rules of the form G: D —)• D which map a point rc G D in the parameter 
space to another point G{w). The most common update is the gradient update rule 

G{w) = w — aVf{w) , 


where a > 0 is a step size and /: D —)• M is a function that we want to optimize. 

The canonical update rule we will consider in this manuscript is an incremental gradient update, 
where G{'w) = w — aVf{w) for some convex function /. We will return to a detailed discussion 
of this specific update in the sequel, but the reader should keep this particular example in mind 
throughout the remainder of this section. 

The following two definitions provide the foundation of our analysis of how two different se¬ 
quences of update rules diverge when iterated from the same starting point. These definitions will 
ultimately be useful when analyzing the stability of stochastic gradient descent. 


Definition 2.3. An update rule is rj-expansive if 


sup 


l|g(^)-gHII 

llu — rcll 


< rj. 


(2.4) 
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Definition 2.4. An update rule is cj-bounded if 


sup ||i(; — G{w) II < (T . (2-5) 

With these two properties, we can establish the following lemma of how a sequence of updates 
to a model diverge when the training set is perturbed. 

Lemma 2.5 (Growth recursion). Fix an arbitrary sequence of updates Gi,... ,Gt and another 
sequence G[,..., G'j,. Let wq = Wq be a starting point in n and define 5t = \\w[ — wt\\ where wt, w[ 
are defined recursively through 


wt+i = Gt{wt) = Gt{wt). 

Then, we have the recurrence relation 


5o = 0 


dt+i < 


r]5t 

< min(r7, l)6t + 2at 


Gt = G[ is rj-expansive 
Gt and G[ are a-hounded, 
Gt is r] expansive 


(t>0) 


(t>0) 


Proof. The first bound on 6t follow directly from the assumption that Gt = G't and the definition of 
expansiveness. For the second bound, recall from Definition 2.4 that if Gt and G^ are cr-bounded, 
then by the triangle inequality, 


5t+i = \\G{wt)-G'{w[)\\ 

< \\G{wt) -wt-\-w[- G'(rc[)|| + \\wt - 
<6t-\- \\G{wt) - wt\\ + ||G(u;() - t(;(|| 

< + 2(7 , 


which gives half of the second bound. We can alternatively bound as 

St+i = \\Gt{wt)-G'M)\\ 

= \\Gt{wt)-Gtiw') + Gt{w')-G[iw')\\ 

< WGtiwt) - Gtiw't)\\ + WGtiw't) - G'tiw'M 

< \\Gt{wt) - Gtiwt)\\ + ||u;( - Gt{wt)\\ + \\w[ - G'((u;()|| 

< gdt + 2a . 

□ 


3 Stability of Stochastic Gradient Method 


Given n labeled examples 5 = (zi,..., Zn) where Zi G Z, consider a decomposable objective function 


f{w) 


1 

n 


n 

i=l 
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where f{w;zi) denotes the loss of w on the example Zi. The stochastic gradient update for this 
problem with learning rate at > 0 is given by 


wt+i = wt- atVwf{wu Zi^). 

Stochastic gradient method (SGM) is the algorithm resulting from performing stochastic gradient 
updates T times where the indices it are randomly chosen. There are two popular schemes for 
choosing the examples’ indices. One is to pick it uniformly at random in {1 ,... ,n} at each step. 
The other is to choose a random permutation over n} and cycle through the examples 

repeatedly in the order determined by the permutation. Our results hold for both variants. 

In parallel with the previous section the stochastic gradient method is akin to applying the 
gradient update rule defined as follows. 

Definition 3.1. For a nonnegative step size a > 0 and a function /: 0 — )■ M, we define the gradient 
update rule Gf^a o,s 

Gf^a{w) = w — aVf{w). 


3.1 Proof idea: Stability of stochastic gradient method 

In order to prove that the stochastic gradient method is stable, we will analyze the output of the 
algorithm on two data sets that differ in precisely one location. Note that if the loss function is 
L-Lipschitz for every example z, we have E|/(w;z) — f{w';z)\ < LK\\w — w'|| for all w and w'. 
Hence, it suffices to analyze how wt and w't diverge in the domain as a function of time t. Recalling 
that Wt is obtained from wt-i via a gradient update, our goal is to bound St = ||wt — w^|| recursively 
and in expectation as a function of St-i- 

There are two cases to consider. In the first case, SGM selects the index of an example at step 
t on which is identical in S and S'. Unfortunately, it could still be the case that St grows, since wt 
and w't differ and so the gradients at these two points may still differ. Below, we will show how to 
control St in terms of the convexity and smoothness properties of the stochastic gradients. 

The second case to consider is when SGM selects the one example to update in which S and 
S' differ. Note that this happens only with probability 1/n if examples are selected randomly. In 
this case, we simply bound the increase in St by the norm of the two gradient Vf{wt-i', z) and 
V/(wj_^; z'). The sum of the norms is bounded by 2atL and we obtain 5t<St + 2atL. Gombining 
the two cases, we can then solve a simple recurrence relation to obtain a bound on St- 

This simple approach suffices to obtain the desired result in the convex case, but there are 
additional difficulties in the non-convex case. Here, we need to use an intriguing stability property 
of stochastic gradient method. Specifically, the first time step to at which SGM even encounters 
the example in which S and S' differ is a random variable in {1,..., n} which tends to be relatively 
large. Specifically, for any m G {1,... ,n}, the probability that to < m is upper bounded by m/n. 
This allows us to argue that SGM has a long “burn-in period” where St does not grow at all. Once 
St begins to grow, the step size has already decayed allowing us to obtain a non-trivial bound. 

We now turn to making this argument precise. 
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3.2 Expansion properties of stochastic gradients 

Let us now record some of the core properties of the stochastic gradient update. The gradient 
update rule is bounded provided that the function / satisfies the following common Lipschitz 
condition. 

Definition 3.2. We say that f is L-Lipschitz if for all points u in the domain of f we have 
||V/(x)|| < L. This implies that 

\f{u) - f{v)\ < L\\u-v\\. (3.1) 


Lemma 3.3. Assume that f is L-Lipschitz. Then, the gradient update Gf^a is (aL)-bounded. 
Proof. By our Lipschitz assumption, ||t(; — Gf^a{w)\\ = ||aV/(t(;)|| < aL . □ 

We now turn to expansiveness. As we will see shortly, different expansion properties are achieved 
for non-convex, convex, and strongly convex functions. 

Definition 3.4. A funetion f: Q —)• M is convex if for all u,v € we have 

f{u) > f{v) + {Vf{v),u-v). 

Definition 3.5. A funetion f: —)> M is y-strongly convex if for all u,v € it we have 

f{u) > fiv) + {Vf{v),u -v) + ^\\u- vf . 

The following standard notion of smoothness leads to a bound on how expansive the gradient 
update is. 

Definition 3.6. A funetion f: Q —M is /3-smooth if for all for all u,v € Q we have 

||V/(u)-V/(n)|| </3||u-n||. (3.2) 


In general, smoothness will imply that the gradient updates cannot be overly expansive. When 
the function is also convex and the step size is sufficiently small the gradient update becomes 
non-expansive. When the function is additionally strongly convex, the gradient update becomes 
contractive in the sense that rj will be less than one and u and v will actually shrink closer to 
one another. The majority of the following results can be found in several textbooks and mono¬ 
graphs. Notable references are Polyak 


34 and Nesterov 30 . We include proofs in the appendix 


for completeness. 


Lemma 3.7. Assume that f is jd-smooth. Then, the following properties hold. 

1. Gf^a is (I + a(d)-expansive. 

2. Assume in addition that f is convex. Then, for any a < 2//3, the gradient update Gf^a is 
1-expansive. 

3. Assume in addition that f is ■j-strongly convex. Then, for a < Gf^a is ^1 — 

expansive. 

Henceforth we will no longer mention which random selection rule we use as the proofs are 
almost identical for both rules. 




3.3 Convex optimization 

We begin with a simple stability bound for convex loss minimization via stochastic gradient method. 

Theorem 3.8. Assume that the loss function /(•; z) is 13-smooth, convex and L-Lipschitz for 
every z. Suppose that we run SGM with step sizes at < 2/(3 for T steps. Then, SGM satisfies 
uniform stability with 

^stab ^ 



Proof. Let S and S' be two samples of size n differing in only a single example. Consider the gradient 
updates Gi,..., Gt and G'l,..., G'rp induced by running SGM on sample S and S', respectively. 
Let wt and w'j. denote the corresponding outputs of SGM. 

We now fix an example z ^ Z and apply the Lipschitz condition on /(•; z) to get 


E \f{wT]z) - f{w'T-z)\ < LE[5t] 


(3.3) 


where 6t = \\wt — Observe that at step t, with probability 1 — 1/n, the example selected 
by SGM is the same in both S and S'. In this case we have that Gt = G( and we can use the 1- 
expansivity of the update rule Gt which follows from Lemma [3.7.2 using the fact that the objective 
function is convex and that at < 2/(3. With probability 1/n the selected example is different in 
which case we use that both Gt and G't are atL-bounded as a consequence of Lemma 3.3 Hence, 


we can apply Lemma 2.5 and linearity of expectation to conclude that for every t, 


E[<5m] < {l--)E[St] + -E[6t] + — =E[St] + — 

' n / n n n 


(3.4) 


Unraveling the recursion gives 


or 

E[5T]<—y^at. 


Plugging this back into equation (3.3), we obtain 


2L^ 


^\f{wT-,z) - f{w'rp-,z)\ < — '^at- 


t=i 


Since this bounds holds for all S, S' and 2 ;, we obtain the desired bound on the uniform stability. 

□ 


3.4 Strongly Convex Optimization 

In the strongly convex case we can bound stability with no dependence on the number of steps at 
all. Assume that the function f{w, z) is strongly convex with respect to w for all z. Let 12 be a 
compact, convex set over which we wish to optimize. Assume further that we can readily compute 
the Euclidean projection onto the set 12, namely, nf 7 (u) = argmin^„gcj ||r(; — u||. In this section we 
restrict our attention to the projected stochastic gradient method 

wt+i = nQ(t(;t - atS/f{wt; zt)). (3.5) 
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A common application of the above iteration in machine learning is solving Tikhonov regnlar- 
ization problems. Specifically, the empirical risk is angmented with an additional regnlarization 
term, 


minimize^, 


1 

n 


i=l 



(3.6) 


where / is as before a pre-specified loss fnnction. We can assume without loss of generality that 
/(O; •) = 1. Then, the optimal solution of (3.6) must lie in a ball of radius r about 0 where 
r = y/2jn. This fact can be ascertained by plugging in u) = 0 and noting that the minimizer of 
( |3.6[ ) must have a smaller cost, thus < Rs,^[w*] < i?5,/x[0] = 1. We can now define the set 

n to be the ball of radius r, in which case the projection is a simple scaling operation. Througout 
the rest of the section we replace f{w;z) with its regularized form, namely, 


f{w-z) ^ f{w,z) + 


2 


\w\ 


2 

2 ) 


which is strongly convex with parameter fi. Similarly, we will overload the constant L to by 
setting 

L = sup sup ||V/(u;;z )||2 . (3.7) 

Note that if f{w;z) is /3-smooth for all z, then L is always finite as it is less than or equal to 
/3diam(n). We need to restrict the supremum to tc G because strongly convex functions have 
unbounded gradients on M"'. We can now state the first result about strongly convex functions. 


Theorem 3.9. Assume that the loss function /(• ;z) is 'y-strongly convex and ft-smooth for all z. 
Suppose we run the projected SGM iteration (3.5) with constant step size a < 1//3 for T steps. 
Then, SGM satisfies uniform stability with 


^stab 


< 


2L2 

yn 


Proof. The proof is analogous to that of Theorem 3.8 with a slightly different recurrence relation. 
We repeat the argument for completeness. Let S and S' be two samples of size n differing in only 
a single example. Consider the gradient updates Gi,..., Gt and G'l,..., G'rp induced by running 
SGM on sample S and S', respectively. Let wt and w'j, denote the corresponding outputs of SGM. 
Denoting 5t = \\ujt — w'tW appealing to the boundedness of the gradient of /, we have 


E\f{wT;z) - f{w'j^]z)\ < ME[5t] ■ 


(3.8) 


Observe that at step t, with probability 1 — 1/n, the example selected by SGM is the same in both 
S and S'. In this case we have that Gt = G'^. At this stage, note that 

dt < \\wt-i - aVf{wt-,zt) -w't_i + aVfiw'p, zt)\\ 


because Euclidean projection does not increase the distance between projected points (see Lemma 4.6 


below f or a g eneralization of this fact). We can now apply the following useful simplification of 


Lemma 


3.7.3 


if a < 1//3; since > ay and ay < 1, is (1 — ay)-expansive. With probability 
1/n the selected example is different in which case we use that both Gt and G( are aM-bounded 
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as a consequence of Lemma 3.3 Hence, we can apply Lemma 2.5 and linearity of expectation to 
conclude that for every t, 


lE<5t+i < ( 1 - - ) (1 - a'y)E6t + -(1 - a'y)E6t + 

' n I n n 


(3.9) 


= (1 - a'^)E5t + 


2aL 


n 


Unraveling the recursion gives 


^ ^ 2La / st 2L 

E6t < -/ (1 — ct'yy < — • 

yn 

t=o ' 


n 


Plugging the above inequality into equation (3.3), we obtain 

2L2 


E \f{wT-,z) - f{w'rp-,z)\ < 


jn 


Since this bounds holds for all S, S' and z, the lemma follows. 


□ 


We would like to note that a nearly identical result holds for a “staircase” decaying step-size 
that is also popular in machine learning and stochastic optimization. 

Theorem 3.10. Assume that the loss function /(•; z) G [0, 1] is j-strongly convex has gradients 
bounded by L as in (3.7), and is jd-smooth function for all z. Suppose we run SGM with step sizes 

2L^ jdp 


at = ^. Then, SGM has uniform stability of 


^stab ^ 


yn 


where p = sup^gj^ sup^ f{w, z). 

Proof. Note that once t > the iterates are contractive with contractivity 1 — oty < 1 — j. Thus, 
for t>to:=^,we have 


E[<54+i] < (1 - i)(l - aa) E[5t] + ^((1 - aty) E[6t] + 2atL) 
= {l-aa)E[5t] + — 



E[<5t] + 


2L 

^tn 


Assuming that = 0 and expanding this recursion, we find: 

T ( T / T 


e[«ti < e n (1 

t=to ls=4+l 


2L 

ytn 


E 


t 2L 


T ytn 

t=to ' 


T — to A 1 2L 
T yn 


Now, the result follows from Lemma 3.11 with the fact that to = 


□ 
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3.5 Non-convex optimization 

In this section we prove stability results for stochastic gradient methods that do not require convex¬ 
ity. We will still assume that the objective function is smooth and Lipschitz as dehned previously. 

The crux of the proof is to observe that SGM typically makes several steps before it even 
encounters the one example on which two data sets in the stability analysis differ. 

Lemma 3.11. Assume that the loss function /(•; z) is nonnegative and L-Lipschitz for all z. Let S 
and S' be two samples of size n differing in only a single example. Denote by wt and wf the output 
of T steps of SGM on S and S', respectively. Then, for every z £ Z and every to £ {0,1,.. •, n}, 
under both the random update rule and the random permutation rule, we have 

E \f{wT;z) - f{w'rp;z)\ < —SMpf{w;z) + LE[5t\ = 0] . 

^ W,Z 

Proof. Let S and S' be two samples of size n differing in only a single example, and let z G Z be 
an arbitrary example. Consider running SGM on sample S and S', respectively. As stated, wt and 
w'rp denote the corresponding outputs of SGM. Let S = l[(Iig = 0] denote the event that dto — 0- 
We have. 


E \f{wT]z) - f{w'rp;z)\ = P{£:}E [\f{wT;z) - fiw'^^z)] \ £] 

+ F{£^}E[\f{wT;z)-f{w'T,z)\\£^] 

< E [\f{wT]z) - f{w'T;z)\ I £] -FPIT'"} • sup f{w;z) 

W,Z 

< LE [||tCT — w'tW I <?] + E• sup f{w; z). 

W,Z 


The second inequality follows from the Lipschitz assumption. 

It remains to bound . Toward that end, let i* G {1, ... ,n} denote the position in which 

S and S' differ and consider the random variable I assuming the index of the first time step in 
which SGM uses the example Zi*. Note that when I > to, then we must have that dtp = 0, since 
the execution on S and S' is identical until step to. Hence, 

P{£:^} = P{<5ip/0}<P{I<to} ■ 

Under the random permutation rule, I is a uniformly random number in {I,..., n} and therefore 

F{I<to} = -. 

n 

This proves the claim we stated for the random permutation rule. For the random selection rule, 
we have by the union bound E {/ < to} < Ylt=i ^ ^ ■ This completes the proof. □ 

Theorem 3.12. Assume that ff',z) G [0,1] is an L-Lipschitz and f-smooth loss function for 
every z. Suppose that we run SGM for T steps with monotonically non-increasing step sizes at < c/t. 
Then, SGM has uniform stability with 


Cstab ^ 


1 l/fc 
n — 1 


/3c 


(2cL 


12 



In particular, omitting constant factors that depend on j3, c, and L, we get 


f^stab 


< 


n 


Proof. Let S and S' be two samples of size n differing in only a single example. Consider the gradient 
updates Gi,, Gt and G'^,..., G'j. induced by running SGM on sample S and S', respectively. 
Let wt and w'rp denote the corresponding outputs of SGM. 


By Lemma 3.11, we have for every to ^ {Ij • • ■ j 


E \f{wT\z) - f{w'T;z)\ < ^ + LE [5t I = 0] , 


(3.10) 


where 6t = ||rct — To simplify notation, let Aj = E [6t \ St^ = 0]. We will bound as function 
of to and then minimize for to- 

Toward this goal, observe that at step t, with probability 1 — 1/n, the example selected by SGM 
is the same in both S and S'. In this case we have that Gt = G'^ and we can use the (1 + atjl)- 
expansivity of the update rule Gt which follows from our smoothness assumption via Lemma [3.7.1[ 
With probability 1/n the selected example is different in which case we use that both Gt and 


are ajL-bounded as a consequence of Lemma 3.3 


Hence, we can apply Lemma 2.5 and linearity of expectation to conclude that for every t > to, 


Ai+i <(l--){! + atf3)At + -At + ^ 

\ n J n n 

- (^ + (l-Vn)(l + c/3/f)) + ^ 

/, / nC/3\ . 2cL 

< exp ((!-!/„)-) A, + —, 

Here we used that I + x < exp(x) for all x. 

Using the fact that At^ = 0, we can unwind this recurrence relation from T down to to + I- 
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This gives 


(n exp((i-^)f)| 

t=tQ-\-l v/c=t + l ) 

T 


2cL 

tn 


2cL 

tn 


= E «p E i 

t —V k=t-\-l 

< Y1 exp((l - i)/3clog(f)) 
t=to + l 

T 

[rp^c{l-l/n) ^-/3c(l-l/n)-l 

t=to+l 


2cL 

tn 


2cL, 

n 


< 


1 


2cL 


(1 — lln)f3c n \toJ 


< 


2L 


/3{n - 1) \toJ 




/3c 


Plugging this bound into (3.10), we get 


^IfiwT^z) - f{w'rp;)\ <—+ 


n I3{n - 1) V^oy 
Letting q = f3c, the right hand side is approximately minimized when 

to = (2cL2) ^ T^+i . 


/3c 


This setting gives us 

E |/(reT; z) — fiw'rp-, z)\ < ^ (2cL^) Pl+r = ^ i^2cL^) . 

Since the bound we just derived holds for all S, S' and z, we immediately get the claimed upper 
bound on the uniform stability. □ 


4 Stability-inducing operations 

In light of our results, it makes sense to analyse for operations that increase the stability of the 
stochastic gradient method. We show in this section that pleasingly several popular heuristics and 
methods indeed improve the stability of SGM. Our rather straightforward analyses both strengthen 
the bounds we previously obtained and help to provide an explanation for the empirical success of 
these methods. 


Weight Decay and Regularization. 


improves generalization 20 


Weight decay is a simple and effective method that often 
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Definition 4.1. Let /: 12 —)• 12, he a differentiable function. We define the gradient update with 
weight decay at rate fi as Gf^^^a{w) = (1 ~ afj,)w — aV/{w). 

It is easy to verify that the above update rule is equivalent to performing a gradient update on 
the £2 -regularized objective g{w) = f{w) + ^||tc|p. 

Lemma 4.2. Assume that f is fd-smooth. Then, is (1 + a{(3 — gi))-expansive. 

Proof. Let G = By triangle inequality and our smoothness assumption, 

||G(u) - G(u;)|| < (1 - an)\\v - w|| + a||V/(u;) - V/(u)|| 

< (1 — afa)\\v — rcll + a/d\\w — u|| 

= (1 — a/i + q;/3)||u — mil . 

□ 

The above lemma shows as that a regularization parameter [i counters a smoothness parame¬ 
ter f. Once r > (5, the gradient update with decay becomes contractive. Any theorem we proved 
in previous sections that has a dependence on f leads to a corresponding theorem for stochastic 
gradient with weight decay in which /3 is replaced with /3 — /U. 


Gradient Clipping. It is common when training deep neural networks to enforce bounds on 
the norm of the gradients encountered by SGD. This is often done by either truncation, scaling, 
or dropping of examples that cause an exceptionally large value of the gradient norm. Any such 
heuristic directly leads to a bound on the Lipschitz parameter L that appears in our bounds. It is 
also easy to introduce a varying Lipschitz parameter Lt to account for possibly different values. 


Dropout. Dropout is a popular and effective heuristic for preventing large neural networks 
from overfitting. Here we prove that, indeed, dropout improves all of our stability bounds gener- 
ically. From the point of view of stochastic gradient descent, dropout is equivalent to setting a 
fraction of the gradient weights to zero. That is, instead of updating with a stochastic gradient 
Vf{w;z) we instead update with a perturbed gradient DVf{w;z) which is is typically identical 
to V f{w; z) in some of the coordinates and equal to 0 on the remaining coordinates, although our 
definition is a fair bit more general. 

Definition 4.3. We say that a randomized map D: Q ^ Ll is a dropout operator with dropout 
rate s if for every v G D we have E \\Dv\\ = s||u||. For a differentiable function f: Ll ^ Q, we let 
DGf a denote the dropout gradient update defined as DGf^a{v) = v — aDiVf{v)) 

As expected, dropout improves the effective Lipschitz constant of the objective function. 

Lemma 4.4. Assume that f is L-Lipschitz. Then, the dropout update DGf^a with dropout rate s 
is (saL)-bounded. 

Proof. By our Lipschitz assumption and linearity of expectation, 


E \\Gf,aiv) - u|| = aE\\DVf{v)\\ = asE ||V/(u)|| < asL,. 


□ 


From this lemma we can obtain various corollaries by replacing L with sL in our theorems. 
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Projections and Proximal Steps. Related to regularization, there are many popular updates 
which follow a stochastic gradient update with a projection onto a set or some statistical shrinkage 
operation. The vast majority of these operations can be understood as applying a proximal-point 
operation associated with a convex function. Similar to the gradient operation, we can define the 
proximal update rule. 

Definition 4.5. For a nonnegative step size a > 0 and a function /: n —)> M, we define the 
proximal update rule Pf^a os 

Pf,a{w) = argmin -^\\w — v\\^ + af{v). (4-1) 


For example. Euclidean projection is the proximal point operation associated with the indicator 
of the associated set. Soft-thresholding is the proximal point operator associated with the £i-norm. 
For more information, see the surveys by Combettes and Wajs or Parikh and Boyd 33 . 

An elementary proof of the following Lemma, due to Rockafellar [36] , can be found in the 
appendix. 


Lemma 4.6. If f is convex, the proximal update (f.l) 


is 1-expansive. 


In particular, this Lemma implies that the Euclidean projection onto a convex set is 1-expansive. 
Note that in many important cases, proximal operators are actually contractive. That is, they are 
r/-expansive with rj < 1. An notable example is when /(•) is the Euclidean norm for which the 
update rule is j^-expansive with rj = {1 a)~^. So stability can be induced by the choice of an 

appropriate prox-operation, which can always be interpreted as some form of regularization. 


Model Averaging. Model averaging refers to the idea of averaging out the iterates wt obtained 
by a run of SGD. In convex optimization, model averaging is sometimes observed to lead to better 
empirical performance of SGM and closely replated updates such as the Perceptron 10 . Here we 


show that model averaging improves our bound for the convex optimization by a constant factor. 


Theorem 4.7. Assume that / : H —)> [0,1] is a decomposable convex L-Lipschitz (I-smooth function 
and that we run SGD with step sizes at < a < 2ffi for T steps. Then, the average of the first T 
iterates of SGD has uniform stability of Cgtab < . 

T 

Proof. Let wt = p wt denoet the average of the stochastic gradient iterates. Since 
t=i 


t 

Wt = '^aVf{wk; ixk,yk)) , 

k=l 


we have 


T 

Wt = a 

t=\ 


T-t + l 

f 


^fiwk] {xk,yk)) 
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Using Lemma 3.8, the deviation between wt and w[ obeys 


which implies 


< (1 - l/n)5t-i + 


1 

n 


+ 2q;L 


T-t + l\ 

T ) ■ 


5t ^ 


2aL 

n 


T 


E 


r-t +1 
T 


aL{T + 1) 
n 


Since / is L-Lipschitz, we have 

IE \f{wT) - f{wT)\ < L\\wt - w'tW < . 

n 

Here the expectation is taken over the algorithm and hence the claim follows by our definition of 
uniform stability. □ 


5 Convex risk minimization 


We now outline how our generalization bounds lead to bounds on the population risk achieved by 
SGM in the convex setting. We restrict our attention to the convex case where we can contrast 
against known results. The main feature of our results is that we show that one can achieve bounds 
comparable or perhaps better than known results on stochastic gradient for risk minimization by 
running for multiple passes over the data set. 

The key to the analysis in this section is to decompose the risk estimates into an optimization 
error term and a stability term. The optimization error designates how closely we optimize the 
empirical risk or a proxy of the empirical risk. By optimizing with stochastic gradient, we will 
be able to balance this optimization accuracy against how well we generalize. These results are 
inspired by the work of Bousquet and Bottou who provided similar analyses for SGM based on 
uniform convergence [^. However, our stability results will yield sharper bounds. 

Throughout this section, our risk decomposition works as follows. We dehne the optimization 
error to be the gap between the empirical risk and minimum empirical risk in expectation: 


By 


Theorem 


2.2 


^ontiw) IE where tcf = argmin[rc] . 

the expected risk of a rc output by SGM is bounded as 


IE[i?[rc]] < E[i?5[r(;]] + Cstab < + eopt(ty) + Cstab • 


In general, the optimization error decreases with the number of SGM iterations while the stability 
increases. Balancing these two terms will thus provide a reasonable excess risk against the empirical 
risk minimizer. Note that our analysis involves the expected minimum empirical risk which could 
be considerably smaller than the minimum risk. However, as we now show, it can never be larger. 

Lemma 5.1. Let rc* denote the minimizer of the population risk and denote the minimizer of 
the empirical risk given a sampled data set S. Then E[i 25 [r(;f]] < 
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Proof. 


= inf = mfKz[f{w, z)] 

W W 

= inf E 5 


> infE^ 


i=l 

n 


= Es 


2 = 1 
n 




2=1 


= E[i?5[«^f]]. 


□ 


To analyze the optimization error, we will make use of a classical result due to Nemirovski and 
Yudin 


29 


Theorem 5.2. Assume we run stochastic gradient descent with constant stepsize a on a convex 
function 

R[w] = E^[f{w,z)] . 

Assume further that ||V/(t(;; 2 ;)|| < L and Hrro — rc*|| < D for some minimizer re* of R. Let wt 
denote the average of the T iterates of the algorithm. Then we have 




R[wt] < + 2 ^ + 2 -^^“ • 


The upper bound stated in the previous theorem is known to be tight even if the function is 
/3-smooth 


29 


If we plug in the population risk for J in the previous theorem, we directly obtain a generalization 
bound for SGM that holds when we make a single pass over the data. The theorem requires fresh 
samples from the distribution in each update step of SGM. Hence, given n data points, we cannot 
make more than n steps, and each sample must not be used more than once. 


Corollary 5.3. Let f be a convex loss function satisfying \\S/ f{w, 2 :)|| < L and let tc* be a minimizer 
of the population risk R[w] = f{w; z). Suppose we make a single pass of SGM over the sample 
S = (zi,..., Zn) with a suitably chosen fixed step size starting from a point tco that satisfies ||t(;o — 
< D. Then, the average Wn of the iterates satisfies 


IE[22[irn]] < R[w*] + 



(5.1) 


We now contrast this bound with what follows from our results. 

Proposition 5.4. Let S = {zi,..., Zn) be a sample of size n. Let f be a fd-smooth convex loss 
function satisfying \\Vf{w,z)\\ < L and let be a minimizer of the empirical risk = 
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n X]r=i Suppose we run T steps of SGM with suitably chosen step size from a starting 
point Wo that satisfies ||i(;o — < D- Then, the average wt over the iterates satisfies 


E[i?[w)T]] < ms[w:]] + 


DL n + 2T 


Proof. On the one hand, applying Theorem 5.2 to the empirical risk Rs, we get 

eopt(hlT) < 5^ + \L^a . 

Here, is an empirical risk minimizer. On the other hand, by our stability bound from Theo¬ 
rem ITT 

TL‘^a 

^stab ^ 


n 


Combining these two inequalities we have, 

E[i?[u)T]] < E[/?5[^/^f]] 

Choosing a to be 

a = 


n2 

1 ^ , 1 r2 

2^ + 2^ 


Dy/n 


2T\ 


1 + — 


n 7 


a 


LyjT{n + 2T) ’ 
yields the bound provided in the proposition. 


□ 


Note that the bound from our stability analysis is not directly comparable to Corollary 5.3 


as we a re co mparing against the expected minimum empirical risk rather than the minimum risk. 

implies that the excess risk in our bound is at most worse by a factor of \/3 compared 


5.1 


Lemma 

with Corollary |5.3| when T = n. Moreover, the excess risk in our bound tends to a factor merely 
-s/2 larger than the Nemirovski-Yudin bound as T goes to infinity. In contrast, the classical bound 
does not apply when T > n. 


6 Experimental Evaluation 

The goal of our experiments is to isolate the effect of training time, measured in number of steps, on 
the stability of SGM. We evaluated broadly a variety of neural network architectures and varying 
step sizes on a number of different datasets. 

To measure algorithmic stability we consider two proxies. The first is the Euclidean distance 
between the parameters of two identical models trained on the datasets which differ by a single 
example. In all of our proofs, we use slow growth of this parameter distance as a way to prove 
stability. Note that it is not necessary for this parameter distance to grow slowly in order for our 
models to be algorithmically stable. This is a strictly stronger notion. Our second weaker proxy 
is to measure the generalization error directly in terms of the absolute different between the test 
error and training error of the model. 

We analyzed four standard machine learning datasets each with their own corresponding deep 
architecture. We studied the LeNet architecture for MNIST, the cuda-convnet architecture for 
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CIFAR-10, the AlexNet model for ImageNet, and the LSTM model for the Penn Treebank Language 
Model (PTB). Full details of our architectures and training procedures can be found below. 

In all cases, we ran the following experiment. We choose a random example from the training 
set and remove it. The remaining examples constitute our set S. Then we create a set S' by 
replacing a random element of S with the element we deleted. We train stochastic gradient descent 
with the same random seed on datasets S and S'. We record the Euclidean distance between the 
individual layers in the neural network after every 100 SGM updates. We also record the training 
and testing errors once per epoch. 

To varying degrees, our experiments show four primary findings: 

1. Typically, halving the step size roughly halves the generalization error. This behavior is fairly 
consistent for both generalization error defined with respect to classification accuracy and 
cross entropy (the loss function used for training). It thus suggests that there is an intrinsic 
linear dependence on the step size in the generalization error. The linear relationship between 
generalization error and step-size is quite pronounced in the CifarlO experiments, as shown 
in Figure 

2. We evaluate the Euclidean distance between the parameters of two models trained on two 
copies of the data differing in a random substitution. We observe that the parameter distance 
grows sub-linearly even in cases where our theory currently uses an exponential bound. This 
shows that our bounds are pessimistic. 

3. There is a close correspondence between the parameter distance and generalization error. A 
priori, it could have been the case that the generalization error is small even though the 
parameter distance is large. Our experiments show that these two quantities often move in 
tandem and seem to be closely related. 

4. When measuring parameter distance it is indeed important that SGM does not immediately 
encounter the random substitution, but only after some progress in training has occurred. If 
we artificially place the corrupted data point at the first step of SGM, the parameter distance 
can grow significantly faster subsequently. This effect is most pronounced in the ImageNet 
experiments, as displayed in Figure 

We evaluated convolutional neural networks for image classification on three datasets: MNIST, 
GifarlO and ImageNet. 


6.1 Convolutional neural nets on Cifar 


Starting with GifarlO, we chose a standard model consisting of three convolutional layers each 
followed by a pooling operation. This model roughly corresponds to that proposed by Krizhevsky 
et al. 19 and available in the “cudaconvnet” cod^ However, to make the experiments more 


interpretable, we avoid all forms of regularization such as weight decay or dropout. We also do not 
employ data augmentation even though this would greatly improve the ultimate test accuracy of 
the model. Additionally, we use only constant step sizes in our experiments. With these restrictions 
the model we use converges to below 20% test error. While this is not state of the art on CifarlO, 
our goal is not to optimize test accuracy but rather a simple, interpretable experimental setup. 
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Figure 1: Generalization error as a function of the number of epochs for varying step sizes 
on CifarlO. Here generalization error is measured with respect to classification accuracy. Left: 
20 epochs. Right: 60 epochs. 
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Figure 2: Generalization error as a function of the number of epochs for varying step sizes on 
GifarlO. Here, generalization error is measured with respect to cross entropy as a loss function. 
Left: 20 epochs. Right: 60 epochs. 
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Normalized euclidean distance 




Figure 3: Normalized euclidean distance between parameters of two models trained under on 
different random substitution on Cifar 10. Here we show the differences between individual 
model layers. 




Figure 4: Parameter distance versus generalization error on CifarlO. 
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Figure 5: Parameter distance and generalization error on MNIST. 


6.2 Convolutional neural nets on MNIST 


The situation on MNIST is largely analogous to what we saw on CifarlO. We trained a LeNet 
inspired model with two convolutional layers and one fully-connected layer. The first and second 
convolutional layers have 20 and 50 hidden units respectively. This model is much smaller and 
converges significantly faster than the CifarlO models, typically achieving best test error in five 
epochs. We trained with minibatch size 60. As a result, the amount of overfitting is smaller as 
shown in Figure 

In the case of MNIST, we also repeated our experiments after replacing the usual cross entropy 
objective with a squared loss objective. The results are displayed in Figure It turned out that 
this does not harm convergence at all, while leading to somewhat smaller generalization error and 
parameter divergence. 

6.3 Convolutional neural nets on ImageNet 


On ImageNet, we trained the standard AlexNet architecture 19 using data augmentation, reg¬ 
ularization, and dropout. Unlike in the case of CifarlO, we were unable to find a setting of hy¬ 
perparameters that yielded reasonable performance without using these techniques. However, for 
Figure]^, we did not use data-augmentation to exaggerate the effects of overfitting and demonstrate 
the impact scaling the model-size. This figure demonstrates that the model-size appears to be a 
second-order effect with regards to generalization error, and step-size has a considerably stronger 
impact. 


^ https://code.google.eom/archive/p/cuda-convnet 
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Figure 6: Training on MNIST with squared loss objective instead of cross entropy. Otherwise 
identical experiments as in the previous figure. 




Figure 7: Left: Performing a random substitution at the beginning of each epoch on 

AlexNet. Right: Random substitution at the end of each epoch. The parameter divergence 
is considerably smaller under late substitution. 
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Figure 8: Left: Generalization error in terms of top 1 precision for varying model size on 
Imagenet. Right: The same with top 5 precision. 

6.4 Recurrent neural networks with LSTM 

We also examined the stability of recurrent neural networks. Recurrent models have a considerably 
different connectivity pattern than their convolutional counterparts. Specifically, we looked at an 
LSTM architecture that was used by Zaremba et al. for language modeling [^. We focused on 
word-level prediction experiments using the Penn Tree Bank (PTB) [^, consisting of 929,000 
training words, 73,000 validation words, and 82,000 test words. PTB has 10,000 words in its 
vocabularj0 Following Zaremba et al., we trained regularized LSTMs with two layers that were 
unrolled for 20 steps. We initialize the hidden states to zero. We trained with minibatch size 
20. The LSTM has 200 units per layer and its parameters are initialized to have mean zero and 
standard deviation of 0.1. We did not use dropout to enhance reproducibility. Dropout would only 
increase the stability of our models. The results are displayed in Figure]^ 


7 Future Work and Open Problems 

Our analysis parts from much previous work in that we directly analyze the generalization perfor¬ 
mance of an algorithm rather than the solution of an optimization problem. In doing so we build 
on the toolkit usually used to prove that algorithms converge in objective value. 

This approach could be more powerful than analyzing optimality conditions, as it may be easier 
to understand how each data point affects a procedure rather than an optimal solution. It also 
has the advantage that the generalization bound holds even if the algorithm fails to find a unique 
optimal solution as is common in non-convex problems. 

In addition to this broader perspective on algorithms for learning, there are many exciting 
theoretical and empirical directions that we intend to pursue in future work. 

High Probability Bounds. The results in this paper are all in expectation. Similar to the well- 
known proofs of the stochastic gradient method, deriving bounds on the expected risk is relatively 

^The data can be accessed at the URL http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-exaiiiples.tgz 
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Figure 9: Training on PTB data set with LSTM architecture. 


straightforward, but high probability bounds need more attention and care 27,35 . In the case 


of stability, the standard techniques from Bousquet and Elisseeff require uniform stability on the 
order of 0(l/n) to apply exponential concentration inequalities like McDiaramid’s j^. For larger 
values of the stability parameter Cstab; it is more difficult to construct such high probability bounds. 
In our setting, things are further complicated by the fact that our algorithm is itself randomized, 
and thus a concentration inequality must be devised to account for both the randomness in the 
data and in the training algorithm. Since differential privacy and stability are closely related, one 
possibility is to derive concentration via an algorithmic method, similar to the one developed by 
Nissim and Stemmer 
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Stability of the gradient method. Since gradient descent can be considered a “limiting case” 
of the stochastic gra dient descent method, one can use an argument like our Growth Recursion 
Lemma (Lemma 2.5) to analyze its stability. Such an argument provides an estimate of Cgtab < ~ 


where a is the step size and T is the number of iterations. Generic bounds for convex functions 
suggest that gradient descent achieves an optimization error of 0(l/r). Thus, a generalization 
bound of 0(l/-y/n) is achievable, but at a computational complexity of SGM, on the other 

hand, achieves a generalization of 0{l/y/n) in time 0{n). 

In the non-convex case, we are unable to prove any reasonable form of stability at all. In 
fact, gradient descent is not uniformly stable as it does not enjoy the “burn-in” period of SGM 
as illustrated in Figure 10 Poor generalization behavior of gradient descent has been observed in 


practice, but lower bounds for this approach are necessary to rule out a stable implementation for 
non-convex machine learning. 


Acceleration and momentum. We have described how many of the best practices in neural 
net training can be understood as stability inducing operations. One very important technique 
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Figure 10: Left: Gradient descent is not uniformly stable for non-convex functions. Right: 
SGM is stable due to “burn-in” period. 


that we did not discuss is momentum. In momentum methods, the update is a linear combination 
of the current iterate and the previous direction. For convex problems, momentum is known 
to decrease the number of iterations required by stochastic gradient descent 22 . For general 


nonlinear problems, is believed to decrease the number of iterations required to achieve low-training 
However, it is not clear that momentum adds stability. Indeed, in the case of convex 


error 
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optimization, momentum methods are less robust to noise than gradient methods [7,23 
is possible that momentum speeds up training but adversely impacts generalization. 


Thus, it 


Model Selection. Another related avenue that bridges theory and practice is using stability as 
a method for model selection. In particular, our results imply that the models that train the fastest 
also generalize the best. This suggests that a heuristic for model selection would be to run many 
different parameter settings and choose the model which results in the lowest training error most 
quickly. This idea is relatively simple to try in practice, and ideas from bandit optimization can 
be applied to efficiently search with this heuristic cost 15 m- From the theoretical perspective, 
understanding the sensibility of this heuristic would require understanding lower bounds for gener- 
alizability. Are there necessary conditions which state that models which take a long training time 
by SGM generalize less well than those with short training times? 


High capacity models that train quickly. If the models can be trained quickly via stochastic 
gradient, our results prove that these models will generalize. However, this manuscript provides no 
guidance as to how to build a model where training is stable and training error is low. Designing a 
family of models which both has high capacity and can be trained quickly would be of significant 
theoretical and practical interest. 

Indeed, the capacity of models trained in current practice steadily increases as growing compu¬ 
tational power makes it possible to effectively train larger models. It is not uncommon for some 
models, such as large neural networks, to have more free parameters than the size of the sample 
yet have rather small generalization error 19,41 . In fact, sometimes increasing the model capacity 

[31]. Is it possible to understand this phenomena 


even seems to deerease the generalization error 
via stability? How can we hnd models which provably both have high capacity and train quickly? 
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Algorithm Design. Finally, we note that stability may also provide new ideas for designing 
learning rules. There are a variety of successful methods in machine learning and signal processing 
that do not compute an exact stochastic gradient, yet are known to hnd quality stationary points 
in theory and practice [^. Do the ideas developed in this paper provide new insights into how to 
design learning rules that accelerate the convergence and improve the generalization of SGM? 
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A Elementary properties of convex functions 

Proof of Lemma \3.7.1\ Let G = Gf^a- By triangle inequality and our smoothness assumption, 

||G(u) - G(u;)|| < ||u - u;|| + a||V/(ri;) - V/(u)|| 

< ||u — wll + aPWw — u|| 

= (1 + aP)\\v — wll . 


□ 


Proof of Lemma 3.7.2. Convexity and /S-smoothness implies that the gradients are co-coercive, 
namely 

.,,0 


(V/(u) - Vf{w),v- w) > j\\Vf{v) - V/Hf , 


We conclude that 


||G/,„(u) - = ||u - wf - 2a(V/(u) - V/H, v - w) + a^\\Vf{v) - Vf{w)f 

< Ik - - (f - «') l|V/(u) - V/Hf 


< Ilf — w\ 


□ 


Proof of Lemma \3. 7 . 4 First, note that if / is 7 strongly convex, then = f{yj) — ^||rc|P is 

convex with (/3 — 7 )-smooth. Hence, applying (A.l) to yields the inequality 


{Vf{v)-Vf{w),v-w,) > 


h 

k + 7 


\v — w\\ + 


/3 + 7 


||v/(u)-v/H 


Using this inequality gives 


\Gf,a{v) - = ||u - wf - 2a{Vf{v) - Vf{w),v - w,) + a^\\Vf{v) - Vf{w) 


< 1-2 


P + 'jJ 


u — tc —a 


k + 7 


-a ||V/(u)-V/H 


1/2 


With our assumption that a < this implies 

The lemma follows by applying the inequality \/\ — x < 1 — x/2 which holds for x G [0,1]. 


\v — w\ 


Proof of Lemma 4-6. This proof is due to Rockafellar 36 . Define 

Pu{w) = argmin —lire — ulP + f{v) 
V 2v 


□ 


(A.2) 


This is the proximal mapping associated with /. Define the map Qu{w) ■= w — Pi,{w). Then, by 
the optimality conditions associated with (A.2), we have 

v~^Qy{w) G df{Py{w)). 
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By convexity of /, we then have 


{Pu{v) - P„{w), Quiv) - Qu{w)) > 0 . 


Using this inequality, we have 

lit, - w\f = \\[Pu{v) - Pu{w)] + [Q,y{v) - Qu{w)]\f 

= \\Pu{v) - Puiw)]]"^ + 2{P^{v) - Py{w),Qy{v) - Qu{w)) + \\Qu{v) - Qu{w)\\'^ 

> \\P.{v) - P.{w)\\\ 

thus completing the proof. □ 
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